Prioritized Experience Replay

1 Overview

Prioritized experience replay was proposed by T. Schaul et. al., and is widely used to speed up reinforcement learning (as far as I know).

Roughly speaking, mis-predicted observations will be learned more frequently. To compensate distorted probability, weight of learning is scaled to the opposite direction (cf. importance sampling).

In cpprb, PrioritizedReplayBuffer class implements these functionalities with proportional base (instead of rank base) priorities.

\(i\)-th Definition
Probability: \( P(i) \) \( \frac{(P_{i}+\epsilon)^{\alpha}}{\sum _{j=0}^{N} (P_{j}+\epsilon)^{\alpha}} \)
Weight: \( w_i \) \( \left( \frac{1}{N} \frac{1}{P(i)}\right) ^{\beta} \)

You can add priorities together with other environment. If no priorities are passed, the stored maximum priority is used.

The dict returned by sample also has special key-values of indexes and weights. The indexes are intended to be passed to update_priorities to update their priorities after comparison with new prediction.

PrioritizedReplayBuffer has hyperparameters alpha and eps at constructor and beta at sample, and their default values are 0.6, 1e-4, and 0.4, respectively. The detail is described in the original paper above.

Parameters Default Description
\( \alpha \) \( 0.6 \) Prioritized parameter. 0 means uniform.
\( \beta \) \( 0.4 \) Compensation parameter. 1 means fully compensate (i.e. Importance Sampling)
\( \epsilon \) \( 10^{-4} \) Small value to avoid 0 priority

2 Example Usage

import numpy as np
from cpprb import PrioritizedReplayBuffer

buffer_size = 256

prb = PrioritizedReplayBuffer(buffer_size,
                              {"obs": {"shape": (4,4)},
                               "act": {"shape": 3},
                               "rew": {},
                               "next_obs": {"shape": (4,4)},
                               "done": {}},
                              alpha=0.5)

for i in range(1000):
    prb.add(obs=np.zeros((4,4)),
            act=np.ones(3),
            rew=0.5,
            next_obs=np.zeros((4,4)),
            done=0)

batch_size = 32
s = prb.sample(batch_size,beta=0.5)

indexes = s["indexes"]
weights = s["weights"]

#  train
#  ...

new_priorities = np.ones_like(weights)
prb.update_priorities(indexes,new_priorities)

3 Notes

4 Technical Detail

To choose prioritized sample efficiently, partial summation and minimum of pre-calculated weights are stored in Segment Tree data structure, which is written by C++ and which was an initial main motivation of this project.

To support multiprocessing, the Segment Tree can be lazily updated, too.